This report explores the idea of determining wine quality based on some chemical properties. Selecting a good wine can be challenging, and I imagine making good wines is even harder. It would be nice to isolate the different chemical properties inherent to excellent wines; perhaps knowing this would be beneficial in wine making.
The wine quality dataset collected by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis consists of measurements of some of the chemical properties of wines. Each observation includes a quality score from 0 (very bad) to 10 (very excellent); these scores are based on sensory data. First, I’ll explore the distributions of each property and then observe the relationships among the chemical properties and their relationships with quality. Let’s see if we can identify good wines based on these measurements.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
According to the dataset, the quality is a score between 0 and 10, 0 being very poor and 10 equals very excellent. The quality median and mean rounded to nearest score is 6.0. I’m going to use the following interpretation since the min and max of the observations are 3 and 9, respectively.
0 - seriously?!
1 - very poor
2 - poor
3 - very bad
4 - bad
5 - below average
6 - average
7 - above average
8 - good
9 - excellent
10 - very excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
This is the distribution of density, measured in grams per liter. There are some clear outliers here; they have been removed in the second graph.
Above we have the distributions of fixed acidity, volatile acidity, and citric acid all measured in grams per liter. Each of these have clear outliers in their right tails. These could signify very excellent or very poor wines. We’ll observe this later.
This is the distribution of pH. Its variance appears to be less than the variances of the acidity measurements as there is no clear outliers in the right tail.
The distributions of chlorides, free SO2, total SO2, and sulfates have some outliers in their right tails also. There’s little variance in the alcohol measurements.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
##
## 0.6 0.7 0.8 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3
## 2 7 25 39 4 93 1 146 3 187 3 147
## 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85 1.9
## 2 184 4 142 2 165 2 99 1 99 3 59
## 1.95 2 2.05 2.1 2.2 2.25 2.3 2.35 2.4 2.5 2.6 2.65
## 2 79 1 51 56 2 42 1 41 40 33 1
## 2.7 2.8 2.85 2.9 3 3.1 3.15 3.2 3.3 3.4 3.5 3.6
## 38 36 1 25 17 17 1 28 23 13 31 22
## 3.7 3.75 3.8 3.85 3.9 3.95 4 4.1 4.2 4.25 4.3 4.35
## 12 2 21 3 17 3 19 17 31 2 19 1
## 4.4 4.45 4.5 4.55 4.6 4.7 4.75 4.8 4.85 4.9 5 5.1
## 14 3 33 2 40 29 5 38 1 35 43 28
## 5.15 5.2 5.25 5.3 5.35 5.4 5.45 5.5 5.55 5.6 5.7 5.8
## 2 29 4 17 2 23 2 13 1 16 30 23
## 5.85 5.9 5.95 6 6.1 6.2 6.3 6.35 6.4 6.5 6.55 6.6
## 2 19 1 23 21 31 39 1 34 26 1 30
## 6.65 6.7 6.75 6.8 6.85 6.9 6.95 7 7.05 7.1 7.2 7.25
## 3 25 1 28 6 20 1 31 2 36 29 2
## 7.3 7.35 7.4 7.45 7.5 7.6 7.7 7.75 7.8 7.85 7.9 7.95
## 19 2 40 1 30 29 34 2 41 1 32 1
## 8 8.1 8.15 8.2 8.25 8.3 8.4 8.45 8.5 8.55 8.6 8.65
## 32 34 1 36 2 31 13 1 24 1 27 1
## 8.7 8.75 8.8 8.9 8.95 9 9.05 9.1 9.15 9.2 9.25 9.3
## 18 2 22 23 1 18 1 17 2 22 2 11
## 9.4 9.5 9.55 9.6 9.65 9.7 9.8 9.85 9.9 10 10.05 10.1
## 10 9 1 18 4 22 16 3 18 18 3 14
## 10.2 10.3 10.4 10.5 10.55 10.6 10.65 10.7 10.8 10.9 11 11.1
## 23 16 25 16 1 22 1 26 17 11 19 18
## 11.2 11.25 11.3 11.4 11.45 11.5 11.6 11.7 11.75 11.8 11.9 11.95
## 18 2 12 14 1 11 15 8 4 35 16 3
## 12 12.05 12.1 12.15 12.2 12.3 12.4 12.5 12.55 12.6 12.7 12.75
## 16 1 21 4 15 13 19 16 2 16 16 1
## 12.8 12.85 12.9 13 13.1 13.15 13.2 13.3 13.4 13.5 13.55 13.6
## 25 4 25 19 23 1 13 16 7 10 3 12
## 13.65 13.7 13.8 13.9 14 14.05 14.1 14.15 14.2 14.3 14.35 14.4
## 4 21 8 18 16 1 4 1 20 17 3 17
## 14.45 14.5 14.55 14.6 14.7 14.75 14.8 14.9 14.95 15 15.1 15.15
## 3 17 3 13 14 2 12 14 2 13 7 1
## 15.2 15.25 15.3 15.4 15.5 15.55 15.6 15.7 15.75 15.8 15.9 16
## 6 1 9 17 11 6 14 9 1 6 2 10
## 16.05 16.1 16.2 16.3 16.4 16.45 16.5 16.55 16.6 16.65 16.7 16.75
## 6 2 7 7 5 1 3 1 2 5 5 2
## 16.8 16.85 16.9 16.95 17 17.05 17.1 17.2 17.3 17.35 17.4 17.45
## 4 4 3 3 1 1 5 9 14 1 2 2
## 17.5 17.55 17.6 17.7 17.75 17.8 17.85 17.9 17.95 18 18.05 18.1
## 8 3 2 1 4 13 5 2 3 2 3 6
## 18.15 18.2 18.3 18.35 18.4 18.5 18.6 18.75 18.8 18.9 18.95 19.1
## 8 3 2 4 1 1 1 4 3 1 3 1
## 19.25 19.3 19.35 19.4 19.45 19.5 19.6 19.8 19.9 19.95 20.15 20.2
## 3 4 1 2 3 2 1 4 1 3 1 2
## 20.3 20.4 20.7 20.8 22 22.6 23.5 26.05 31.6 65.8
## 1 1 2 2 2 1 1 2 2 1
Let’s take a closer look at residual sugar. Its distribution might be bi-modal.
The information provided with the dataset describes acidity as either fixed or volatile. It also states that the fixed acidity is attributed to the amount of tartaric acid in the wine, and volatile acidity is attributed to the amount of acetic acid. Citric acid is another attribute of the data but is not denoted as fixed or volatile. Because of the specificity of the fixed acidity attribute, I’ll assume citric acid is another form of fixed acidity.
Free sulfur dioxide is a component of the total sulfur dioxide. Let’s create two new variables. One will represent the ratio of free sulfur dioxide to total sulfur dioxide. The other will represent the total acidity; this will be the sum of the fixed (tartaric), citric, and volatile (acetic) acids.
The SO2 ratio appears to be centered around 0.26 and the total acidity near 7.6 g/L.
The white wine dataset consists of 4898 observations of 12 variables (alcohol, chlorides, citric acid, fixed acidity, free sulfur dioxide, pH, residual sugar, sulfates, total sulfur dioxide, volatile acidity, and quality). I have added two more variables (sulfur dioxide ratio and total acidity). All variables except quality are continuous. Quality is a categorical variable ranging from 0 to 10; in this dataset quality ranges from 3 to 9.
I would like to know which chemical properties influence the quality of white wines.
Distribution of citric acid looks similar to that of density. I’m curious to see if these two variables are correlated. Also density, citric acid, chlorides, free sulfur dioxide, and residual sugar had clear outliers. I would like to see if these outliers signify wine quality.
I created the variable sulfur dioxide ratio; it is the value of the free sulfur dioxide divided by the total sulfur dioxide. I also created the variable total acidity. It is the sum of the fixed acidity, citric acid, and volatile acidity.
The distribution of residual sugar looks as though there may be two groups to consider. Some outliers were cut off to get a better look at the distribution.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.02269729 0.289180698
## volatile.acidity -0.02269729 1.00000000 -0.149471811
## citric.acid 0.28918070 -0.14947181 1.000000000
## residual.sugar 0.08902070 0.06428606 0.094211624
## chlorides 0.02308564 0.07051157 0.114364448
## free.sulfur.dioxide -0.04939586 -0.09701194 0.094077221
## total.sulfur.dioxide 0.09106976 0.08926050 0.121130798
## density 0.26533101 0.02711385 0.149502571
## pH -0.42585829 -0.03191537 -0.163748211
## sulphates -0.01714299 -0.03572815 0.062330940
## alcohol -0.12088112 0.06771794 -0.075728730
## quality -0.11366283 -0.19472297 -0.009209091
## sulfur.dioxide.ratio -0.13945918 -0.19616085 0.016241396
## total.acidity 0.98717874 0.07157062 0.394143356
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.08902070 0.02308564 -0.0493958591
## volatile.acidity 0.06428606 0.07051157 -0.0970119393
## citric.acid 0.09421162 0.11436445 0.0940772210
## residual.sugar 1.00000000 0.08868454 0.2990983537
## chlorides 0.08868454 1.00000000 0.1013923521
## free.sulfur.dioxide 0.29909835 0.10139235 1.0000000000
## total.sulfur.dioxide 0.40143931 0.19891030 0.6155009650
## density 0.83896645 0.25721132 0.2942104109
## pH -0.19413345 -0.09043946 -0.0006177961
## sulphates -0.02666437 0.01676288 0.0592172458
## alcohol -0.45063122 -0.36018871 -0.2501039415
## quality -0.09757683 -0.20993441 0.0081580671
## sulfur.dioxide.ratio 0.05142979 -0.03321768 0.7386321024
## total.acidity 0.10473749 0.04552987 -0.0451333172
## total.sulfur.dioxide density pH
## fixed.acidity 0.091069756 0.26533101 -0.4258582910
## volatile.acidity 0.089260504 0.02711385 -0.0319153683
## citric.acid 0.121130798 0.14950257 -0.1637482114
## residual.sugar 0.401439311 0.83896645 -0.1941334540
## chlorides 0.198910300 0.25721132 -0.0904394560
## free.sulfur.dioxide 0.615500965 0.29421041 -0.0006177961
## total.sulfur.dioxide 1.000000000 0.52988132 0.0023209718
## density 0.529881324 1.00000000 -0.0935914935
## pH 0.002320972 -0.09359149 1.0000000000
## sulphates 0.134562367 0.07449315 0.1559514973
## alcohol -0.448892102 -0.78013762 0.1214320987
## quality -0.174737218 -0.30712331 0.0994272457
## sulfur.dioxide.ratio -0.013447850 -0.06552475 0.0008012900
## total.acidity 0.113188502 0.27560881 -0.4306513315
## sulphates alcohol quality
## fixed.acidity -0.01714299 -0.12088112 -0.113662831
## volatile.acidity -0.03572815 0.06771794 -0.194722969
## citric.acid 0.06233094 -0.07572873 -0.009209091
## residual.sugar -0.02666437 -0.45063122 -0.097576829
## chlorides 0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide 0.05921725 -0.25010394 0.008158067
## total.sulfur.dioxide 0.13456237 -0.44889210 -0.174737218
## density 0.07449315 -0.78013762 -0.307123313
## pH 0.15595150 0.12143210 0.099427246
## sulphates 1.00000000 -0.01743277 0.053677877
## alcohol -0.01743277 1.00000000 0.435574715
## quality 0.05367788 0.43557472 1.000000000
## sulfur.dioxide.ratio -0.02236186 0.06446642 0.197214077
## total.acidity -0.01185225 -0.11751272 -0.131377207
## sulfur.dioxide.ratio total.acidity
## fixed.acidity -0.13945918 0.98717874
## volatile.acidity -0.19616085 0.07157062
## citric.acid 0.01624140 0.39414336
## residual.sugar 0.05142979 0.10473749
## chlorides -0.03321768 0.04552987
## free.sulfur.dioxide 0.73863210 -0.04513332
## total.sulfur.dioxide -0.01344785 0.11318850
## density -0.06552475 0.27560881
## pH 0.00080129 -0.43065133
## sulphates -0.02236186 -0.01185225
## alcohol 0.06446642 -0.11751272
## quality 0.19721408 -0.13137721
## sulfur.dioxide.ratio 1.00000000 -0.15258717
## total.acidity -0.15258717 1.00000000
None of the correlations of quality and other properties are the highest in this scatterplot matrix. The plots of the relationships of quality versus the other properties do not show clear dependencies on any particular property. I want to observe the distributions of each chemical property according to quality scores.
It appears as alcohol content increases so does the quality score; however the bad wines have more alcohol than the below average quality wines. I’m less concerned with the actual quality score; I’m more interested in whether a wine is good or bad. From now on, I’ll use good, bad, or average to denote wine quality.
## bad below avg average above avg excellent
## 183 1457 2198 880 180
There are a lot more below average wines than there are above average wines. However, the number of excellent and bad wine observations are almost the same. It should be interesting to see the difference between the measurements for these two groups.
There is almost a 2-point difference in the means between bad and excellent wine alcohol content.
I want to take a closer look in order to find a line of demarcation between good and bad white wines.
The mean chloride level of bad wines is greater than the level of excellent wines. Higher levels of sodium chloride could represent a more bitter taste in the wine.
The density of wine depends on the alcohol and sugar content, and I can see the shape of the plot is almost the inverse of the shape in the plot of alcohol and quality.
There are no major differences in the means of fixed acidity levels among the groups. Fixed acidity must not be a major factor in wine quality.
There is a separation between the means of the citric acid levels in the excellent wine and bad wine groups, but not much difference between excellent wines and below average wines. The amount of citric acid could mean the difference in personal preferences when it comes to wine tastes.
Volatile acidity is the level of acetic acid in wine, and too much acetic acid is a bad thing. Approximately 300 mg/L of acetic acid appears to be too much.
There appears to be a relationship between total acidity and quality. Because there was not clear relationship between fixed acidity or citric acid and quality, I assume the relationship between total acidity and quality is due to the levels of volatile acidity.
The mean pH of wines becomes more basic as wine quality increases from bad to excellent.
There is a difference of around 17 mg/L between the mean free SO2 levels of bad wines and excellent wines.
The mean SO2 ratio increases from around 0.16 to 0.29 as wine quality increases from bad to excellent.
The mean levels of potassium sulfate, represented by the variable sulphates, is almost the same across each quality level.
Mean residual sugar levels vary across each quality group. There is no clear relationship. I am also interested in the relationship of some of the variables with higher correlations.
I can imagine a negatively-sloped line running through each scatterplot above. These would fit with the computed correlations from above.
Potassium Sulfate (represented by the sulphates variable) is an additive that contributes to the sulfur dioxide levels in the wine. I want to examine that relationship.
I can imagine a positively-sloped line running through the relationship between sulfates and total SO2. However, the correlation between the two does not appear to be high.
## wines[, 16]: bad
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.200 Min. :0.110 Min. :0.0000 Min. : 0.700
## 1st Qu.: 6.400 1st Qu.:0.260 1st Qu.:0.2050 1st Qu.: 1.350
## Median : 6.900 Median :0.320 Median :0.3000 Median : 2.700
## Mean : 7.181 Mean :0.376 Mean :0.3077 Mean : 4.821
## 3rd Qu.: 7.650 3rd Qu.:0.460 3rd Qu.:0.4000 3rd Qu.: 7.500
## Max. :11.800 Max. :1.100 Max. :0.8800 Max. :17.550
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01300 Min. : 3.00 Min. : 10.0
## 1st Qu.:0.03750 1st Qu.: 9.00 1st Qu.: 85.5
## Median :0.04600 Median : 18.00 Median :119.0
## Mean :0.05056 Mean : 26.63 Mean :130.2
## 3rd Qu.:0.05400 3rd Qu.: 33.50 3rd Qu.:177.0
## Max. :0.29000 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9892 Min. :2.830 Min. :0.250 Min. : 8.00
## 1st Qu.:0.9926 1st Qu.:3.060 1st Qu.:0.380 1st Qu.: 9.40
## Median :0.9941 Median :3.160 Median :0.470 Median :10.10
## Mean :0.9943 Mean :3.183 Mean :0.476 Mean :10.17
## 3rd Qu.:0.9960 3rd Qu.:3.285 3rd Qu.:0.540 3rd Qu.:10.80
## Max. :1.0004 Max. :3.720 Max. :0.870 Max. :13.50
## quality sulfur.dioxide.ratio total.acidity
## Min. :3.000 Min. :0.03371 Min. : 4.645
## 1st Qu.:4.000 1st Qu.:0.10543 1st Qu.: 7.020
## Median :4.000 Median :0.16129 Median : 7.630
## Mean :3.891 Mean :0.18883 Mean : 7.865
## 3rd Qu.:4.000 3rd Qu.:0.23852 3rd Qu.: 8.450
## Max. :4.000 Max. :0.65682 Max. :12.410
## --------------------------------------------------------
## wines[, 16]: below avg
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.500 Min. :0.100 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.400 1st Qu.:0.240 1st Qu.:0.2400 1st Qu.: 1.800
## Median : 6.800 Median :0.280 Median :0.3200 Median : 7.000
## Mean : 6.934 Mean :0.302 Mean :0.3377 Mean : 7.335
## 3rd Qu.: 7.400 3rd Qu.:0.340 3rd Qu.:0.4100 3rd Qu.:11.500
## Max. :10.300 Max. :0.905 Max. :1.0000 Max. :23.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.04000 1st Qu.: 22.00 1st Qu.:121.0
## Median :0.04700 Median : 35.00 Median :151.0
## Mean :0.05155 Mean : 36.43 Mean :150.9
## 3rd Qu.:0.05300 3rd Qu.: 50.00 3rd Qu.:182.0
## Max. :0.34600 Max. :131.00 Max. :344.0
## density pH sulphates alcohol
## Min. :0.9872 Min. :2.790 Min. :0.2700 Min. : 8.000
## 1st Qu.:0.9933 1st Qu.:3.080 1st Qu.:0.4200 1st Qu.: 9.200
## Median :0.9953 Median :3.160 Median :0.4700 Median : 9.500
## Mean :0.9953 Mean :3.169 Mean :0.4822 Mean : 9.809
## 3rd Qu.:0.9972 3rd Qu.:3.240 3rd Qu.:0.5300 3rd Qu.:10.300
## Max. :1.0024 Max. :3.790 Max. :0.8800 Max. :13.600
## quality sulfur.dioxide.ratio total.acidity
## Min. :5 Min. :0.02362 Min. : 4.900
## 1st Qu.:5 1st Qu.:0.17188 1st Qu.: 6.970
## Median :5 Median :0.23810 Median : 7.500
## Mean :5 Mean :0.23772 Mean : 7.574
## 3rd Qu.:5 3rd Qu.:0.29646 3rd Qu.: 8.120
## Max. :5 Max. :0.65000 Max. :11.030
## --------------------------------------------------------
## wines[, 16]: average
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.000 Min. : 0.700
## 1st Qu.: 6.300 1st Qu.:0.2000 1st Qu.:0.270 1st Qu.: 1.700
## Median : 6.800 Median :0.2500 Median :0.320 Median : 5.300
## Mean : 6.838 Mean :0.2606 Mean :0.338 Mean : 6.442
## 3rd Qu.: 7.300 3rd Qu.:0.3000 3rd Qu.:0.380 3rd Qu.: 9.900
## Max. :14.200 Max. :0.9650 Max. :1.660 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01500 Min. : 3.00 Min. : 18.0
## 1st Qu.:0.03600 1st Qu.: 24.00 1st Qu.:107.2
## Median :0.04300 Median : 34.00 Median :132.0
## Mean :0.04522 Mean : 35.65 Mean :137.0
## 3rd Qu.:0.04900 3rd Qu.: 46.00 3rd Qu.:164.0
## Max. :0.25500 Max. :112.00 Max. :294.0
## density pH sulphates alcohol
## Min. :0.9876 Min. :2.720 Min. :0.2300 Min. : 8.50
## 1st Qu.:0.9917 1st Qu.:3.080 1st Qu.:0.4100 1st Qu.: 9.60
## Median :0.9937 Median :3.180 Median :0.4800 Median :10.50
## Mean :0.9940 Mean :3.189 Mean :0.4911 Mean :10.58
## 3rd Qu.:0.9959 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.810 Max. :1.0600 Max. :14.00
## quality sulfur.dioxide.ratio total.acidity
## Min. :6 Min. :0.03361 Min. : 4.130
## 1st Qu.:6 1st Qu.:0.19836 1st Qu.: 6.860
## Median :6 Median :0.25862 Median : 7.370
## Mean :6 Mean :0.26217 Mean : 7.436
## 3rd Qu.:6 3rd Qu.:0.32046 3rd Qu.: 7.940
## Max. :6 Max. :0.71053 Max. :14.960
## --------------------------------------------------------
## wines[, 16]: above avg
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. :4.200 Min. :0.0800 Min. :0.0100 Min. : 0.900
## 1st Qu.:6.200 1st Qu.:0.1900 1st Qu.:0.2800 1st Qu.: 1.700
## Median :6.700 Median :0.2500 Median :0.3100 Median : 3.650
## Mean :6.735 Mean :0.2628 Mean :0.3256 Mean : 5.186
## 3rd Qu.:7.200 3rd Qu.:0.3200 3rd Qu.:0.3600 3rd Qu.: 7.325
## Max. :9.200 Max. :0.7600 Max. :0.7400 Max. :19.250
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 5.00 Min. : 34.0
## 1st Qu.:0.03100 1st Qu.: 25.00 1st Qu.:101.0
## Median :0.03700 Median : 33.00 Median :122.0
## Mean :0.03819 Mean : 34.13 Mean :125.1
## 3rd Qu.:0.04400 3rd Qu.: 41.00 3rd Qu.:144.2
## Max. :0.13500 Max. :108.00 Max. :229.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.840 Min. :0.2200 Min. : 8.60
## 1st Qu.:0.9906 1st Qu.:3.100 1st Qu.:0.4100 1st Qu.:10.60
## Median :0.9918 Median :3.200 Median :0.4800 Median :11.40
## Mean :0.9925 Mean :3.214 Mean :0.5031 Mean :11.37
## 3rd Qu.:0.9937 3rd Qu.:3.320 3rd Qu.:0.5800 3rd Qu.:12.30
## Max. :1.0004 Max. :3.820 Max. :1.0800 Max. :14.20
## quality sulfur.dioxide.ratio total.acidity
## Min. :7 Min. :0.0500 Min. :4.730
## 1st Qu.:7 1st Qu.:0.2118 1st Qu.:6.810
## Median :7 Median :0.2717 Median :7.310
## Mean :7 Mean :0.2757 Mean :7.323
## 3rd Qu.:7 3rd Qu.:0.3333 3rd Qu.:7.820
## Max. :7 Max. :0.6429 Max. :9.870
## --------------------------------------------------------
## wines[, 16]: excellent
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. :3.900 Min. :0.120 Min. :0.0400 Min. : 0.800
## 1st Qu.:6.200 1st Qu.:0.200 1st Qu.:0.2800 1st Qu.: 2.075
## Median :6.800 Median :0.260 Median :0.3200 Median : 4.300
## Mean :6.678 Mean :0.278 Mean :0.3282 Mean : 5.628
## 3rd Qu.:7.300 3rd Qu.:0.330 3rd Qu.:0.3600 3rd Qu.: 8.150
## Max. :9.100 Max. :0.660 Max. :0.7400 Max. :14.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01400 Min. : 6.00 Min. : 59.0
## 1st Qu.:0.03000 1st Qu.: 28.00 1st Qu.:102.8
## Median :0.03550 Median : 34.50 Median :122.0
## Mean :0.03801 Mean : 36.63 Mean :125.9
## 3rd Qu.:0.04400 3rd Qu.: 44.25 3rd Qu.:148.5
## Max. :0.12100 Max. :105.00 Max. :212.5
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.940 Min. :0.2500 Min. : 8.50
## 1st Qu.:0.9903 1st Qu.:3.127 1st Qu.:0.3800 1st Qu.:11.00
## Median :0.9916 Median :3.230 Median :0.4600 Median :12.00
## Mean :0.9922 Mean :3.221 Mean :0.4857 Mean :11.65
## 3rd Qu.:0.9935 3rd Qu.:3.330 3rd Qu.:0.5825 3rd Qu.:12.60
## Max. :1.0006 Max. :3.590 Max. :0.9500 Max. :14.00
## quality sulfur.dioxide.ratio total.acidity
## Min. :8.000 Min. :0.07895 Min. :4.525
## 1st Qu.:8.000 1st Qu.:0.22308 1st Qu.:6.855
## Median :8.000 Median :0.28767 Median :7.385
## Mean :8.028 Mean :0.28929 Mean :7.284
## 3rd Qu.:8.000 3rd Qu.:0.33621 3rd Qu.:7.768
## Max. :9.000 Max. :0.60377 Max. :9.820
We can compare the statistical values of each variable according to the wine quality and notice values that differentiate excellent wines from poor wines.
I was curious about the relationships between sulfates and the sulfur dioxide levels. I was surprised there was not a more evident relationship. I also observed the correlation between sugar and alcohol, pH and total acidity, and alcohol and total sulfur dioxide. The correlations computed earlier were evident in the graphs.
Some of these relations are expected since density depends on the alcohol and sugar content in wine. The acidity and basicity are described by the pH of wine. Alcohol and quality have a correlation of 0.44; good wines tend to have more alcohol than bad wines.
The first plot shows how the different combinations of alcohol and volatile acidity levels contribute to wine quality. The second plot shows the location of each cluster.
This shows the location of the alcohol-volatile acidity clusters in relation to the SO2 ratio.
This graph shows the clusters of chlorides and residual sugar relationship. They observations are colored by alcohol content helping to distinguish high and low quality wines.
This graph shows the relationship of alcohol, free SO2, and the SO2 ratio by quality. Excellent wines are clustering around 12% alcohol and a free SO2 level around 30 mg/L.
Alcohol seems to be the better chemical property for distinguishing cluster centers. Chloride levels of around 40 mg/L and 12% alcohol are present in excellent wines.
There’s no clear division in this graph.
I know alcohol content around 12% is a good indicator of good quality wines. The red observations in the bad and below average plots indicate that combinations of levels less than 0.03 g/L of chlorides and less than 20 g/L of free SO2 are not good for wines.
Alcohol strengthens many of the variables when looking at wine quality. Quality clusters by alcohol content. The plots in the Multivariate section show clusters of wine quality for different combinations of alcohol content, chlorides, free SO2, SO2 ratio, and volatile acidity.
Low chlorides and high alcohol interact well when using residual sugar to differentiate high quality wines. In the Bivariate section, the sulfur dioxide ratio increased wine quality as its value increased towards 40%. Its effect on wine quality was less evident when plotted against alcohol and chlorides.
Wines with the highest quality have the highest median alcohol content as opposed to lower quality wines which have lower median alcohol content. The median alcohol content by volume is 10.1, 9.5, 10.5, 11.4, and 12.0 percent for bad, below average, average, above average, and excellent wines respectively. That leaves a margin of 1.9 percent between bad wines and excellent wines.
Wines with the highest quality have the highest median SO2 ratio as opposed to lower quality wines which have lower median SO2 ratio. The median SO2 ratio is 0.16, 0.24, 0.26, 0.27, and 0.29 respectively for bad, below average, average, above average, and excellent wines. As the wine quality increases, the median SO2 ratio increases also. A linear model could be constructed to predict the quality of wines using the ratio of free SO2 to total SO2.
High quality wines and bad wines are clustering in two different regions. The mean alcohol and volatile acidity levels in excellent wines is 11.65% and 278 mg/L, respectively. Excellent wines are clustered above 11% alcohol and near 300 mg/L of volatile acidity (acetic acid). These wines more likely have higher SO2 ratios as well; the mean SO2 ratio is 0.29. The alcohol content in bad wines is centered around the mean 10.17% by volume; the mean SO2 ratio is 0.19. The mean volatile acidity (acetic acid) level in bad wines is 376 mg/L. Bad wines are clustered below 11% alcohol and more likely have higher levels of volatile acidity and lower SO2 ratios.
The white wines dataset contains information for almost 5,000 wines. After investigating the distributions of the individual variables and outliers in the data set, I explored the relationships between certain variables using plots. I am interested in discerning which variables affect the quality of white wines. Initially, I was confused by the variables, their names, and the information provided about the dataset. I first created two new variables representative of the relationships among separate variables. Then I observed the quality of wines across each variable. I was disappointed the scatterplot matrix did not show high correlations apart from the variables I created. I wanted to see clear relationships between quality and the other variables; the scatterplots showed a lot of overplotting. Also there were too many quality levels for distinguishing good wines from bad ones so I combined some levels into groups distinguishing bad, average, and good. Plotting the distribution of each variable by the levels of quality showed the relationships I wanted to see. I noticed trends in the effect alcohol, chlorides, volatile acidity, and free SO2 have on wine quality. Also I was surprised that sulfates did not have a higher correlation with free SO2 and the total SO2. And I was curious to see some correlation between alcohol content and the amount of residual sugar.
The multivariate analysis shows that in further exploration a linear model could be fit to the data. Alcohol and chlorides were good indicators of wine quality. Comparing the 12% alcohol level of usually good wines with the other variables showed the negative effect their levels could have on wine quality. For example, the plot of quality vs alcohol, chlorides, and free SO2 shows the effect of too low alcohol content and too high chloride levels on wine quality. There is not a lot of margin in some chemical properties when comparing good and bad wines. The task will be a difficult one, but perhaps white wine quality can be predicted using some of the variables mentioned previously.